BI Final Project-Douban moive reviews analysis

Student Name_Student ID: Hu Lingzhen_1155146522; Ye Kaixin_1155153576; Zhou Yanxuting_1155151493; Wang Ziwei_1155147399; Hu Yuxuan_1155153427

1.Pre-processing

We have added some steps about data preprocessing to make the data cleaner and better processed. We deleted the duplicated comments and made a moive stop-word list.

2.Using ti-idfs as independent variables to predict the sentiment

We first tried to build lda and lsa topic models to make prediction(the same as project 3 method), but the result is not ideal. No mather we used either model, the probability of prediction is around the random guess's probability. Therefore, we move our direction to using ti-idfs as dependent variables to build our model after refering to other nlp sentiment prediction projects. We uses several classification models, then print out their classification report and draw ROC plot. We choose the best model according to both the f1-score and AUC score.

Multinomial NB

Logistic Regression

Linear SVM

Perception

MLP Classifier

GradientBoost

AdaBoost

3.Build the LSA and LDA topic models for the reviews

We use two models LDA and LSA to handle the review documents. For each model, we use the corpus after applying tf-idf and dictionary to fit the model and get several topics. Secondly, we use coherence values to find the optimal topic number.Coherence value help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. So we use that with highest score to decide the number of topic. Then, we print the topics and the score of each topic for the reviews.

LDA-choose the topic number

We first choose the optimal topic number.

LDA-build the model

LSA-choose the topic number

LSA-build the model

4.Now, let's use a new film 'Frozen' to test our model!